testingiauh912fandomcom-20200214-history
Validity and the Social Dimension
Sina Ashrafi Validity and the Social Dimension of Language Testing In what ways is the social dimension of language assessment reflected in current theories of the validation of language tests? we will consider the theories of validation that have most influenced current thinking in our field, in the work of Messick (following Cronbach) and his successors Mislevy and Kane. Cronbach Contemporary discussions of validity in educational assessment are heavily influenced by the thinking of the American Lee Cronbach; it is only a slight overstatement to call Cronbach the “father” of construct validity. Cronbach Note that the language of “trait” and “underlying quality” frames the target of validation in the language of individuality and cognition. There is also clear recognition that validity is not a mathematical property like discrimination or reliability, but a matter of judgment. Cronbach (1989) emphasized the need for a validity argument, the need for a validity argument, which focuses on collecting evidence for or against a certain interpretation of test scores: In other words, it is the validity of inferences that construct validation work is concerned with, rather than the validity of instruments. In fact, Cronbach argued that there is no such thing as a “valid test,” only more or less defensible interpretations: : “One does not validate a test, but only a principle for making inferences” (Cronbach&Meehl, 1955, p. 297); “One validates not a test, but an interpretation of data arising from a specified procedure” (Cronbach, 1971, p. 447). Cronbach highlighted the role of beliefs and values in validity arguments, which “must link concepts, evidence, social and personal consequences, and values” (Cronbach, 1988, p. 4). Messick Messick, like Cronbach, saw assessment as a process of reasoning and evidence gathering carried out in order for inferences to be made about individuals and saw the task of establishing the meaningfulness and defensibility of those inferences as being the primary task of assessment development and research. This reflects an individualist, psychological tradition of measurement concerned with fairness. He introduced the social more explicitly into this picture by arguing two things: that our conceptions of what it is that we are measuring and the things we prioritize in measurement, will reflect values, which we can assume will be social and cultural in origin, and that tests have real effects in the educational and social contexts in which they are used and that these need to be matters of concern for those responsible for the test. Mislevy has developed an approach called Evidence Centered Design, which focuses on the chain of reasoning in designing tests. Step 1 involves the test designer in articulating the claims '' the test will make about candidates on the basis of test performance. Step 2 involves determining the kind of ''evidence ''that would be necessary to support the claims established in step 1. Step 3 involves defining in general terms the kinds of task '' in which the candidate will be required to engage so that the evidence set out in step 2 might be sought. All three steps precede the actual writing of specifications for test tasks; they constitute the “thinking stage” of test design. Only when this chain of reasoning is completed can the specifications for test tasks be written. Kane Kane has also developed a systematic approach to thinking through the process of drawing valid inferences from test scores. Kane points out that we interpret scores as having meaning. The same score might have different interpretations. Whatever the interpretation we choose, Kane argues, we need an argument to defend the relationship of the score to that interpretation. He calls this an interpretative argument, defined as a “chain of inferences from the observed performances to conclusions and decisions'' ''included in the interpretation” (Kane, Crooks, & Cohen, 1999, p. 6). Kane proposes four types of inference in the chain of inferences constituting the interpretative argument. The first inference is from observation to observed score. In order for assessment to be possible, an instance of learner behavior needs to be observable. The second inference is from the observed score to what Kane called the universe score, deliberately using terminology from generalizability theory (Brennan, 2001). This inference is that the observed score is consistent across tasks, judges, and occasions. Kane pointed out that generalization across tasks is often poor in complex performance assessments: that because a person can handle a complex writing task involving one topic and supporting stimulus material it does not necessarily mean that the person will perform in a comparable way on another topic and another set of materials. If task generalizability is weak, or if the impact of raters or the rating process is large, we cannot move on in the interpretative argument. The third type of inference, from the universe score to the target score, is that commonly dealt with under the heading of construct validity. The fourth type of inference, from the target score to the decision based on the score, moves the test into the world of test use and test context. Kane thus distinguishes two types of inference (semantic inferences and policy inferences) and two related types of interpretation. That only involve semantic inferences are called descriptive interpretations; interpretations involving policy inferences are called decision-based interpretations.